DOCUMENT BESUME 



ED 047 319 



AL 002 762 



AUTHOR 

TITLE 



INSTITUTION 

SPONS AGENCY 
PU3 DATE 
NOTE 



EDRS PRICE 
DESCHIPTORS 



Marlin, Marjorie; Barron, Nancy 

Methodology and Implications of Reconstruction and 
Automatic Processing of Natural Language of the 
Classroom* 

Missouri Univ. , Coluabia. Center for Research in 
Social Behavior* 

National Science Foundation, Washington, D.C* 

Feb 71 

16p,; Paper presented at the Annual Meeting of the 
American Educational Research Association, New York, 
N. Y. , February 1971 

EDRS Price MF-S0.65 HC-$3.29 

♦Classroom Communication, Computational Linguistics, 
♦Computer Programs, ♦Data Processing, Grammar, 
♦Research Methodology, Sociolinguistics, ♦Structural 
Analysis 



ABSTRACT 

This paper discusses in some detail the procedural 
areas of reconstruction and automatic processing used by the 
Classroom Interaction Project of the University of Missouri's Center 
for ' esearch in Social Behavior in the analysis of classroom 
language. First discussed is the process of reconstruction, here 
defined as the "process of adding to messages what is otherwise not 
directly observable in the overt communicating behavior (i.e., 
grammatically and contextually implicit information) , and of 
structuring this behavior into simplex sentences and lexical units . n 
This is followed by a consideration of the role of the computer in 
processing the reconstructed data and a discussion of the computer 
programs used. The author believes that the language analysis system 
explained here has the advantage of universality, being applicable 
not only to classroom discourse but to any corpus of linguistic 
communication as well* See related documents AL 002 750-753. (FWB) 







K\ 

1 ^. 

O 

o 

LU 



METHODOLOGY AND IMPLICATIONS OF RECONSTRUCTION AND AUTOMATIC 
PROCESSING OF NATURAL LANGUAGE OF THE CLASSROOM* 



by 



Marjorie Marlin 
Nancy Barron 



I 



i 




U t -DEPARTMENT OF HEALTH, 
EDUCATION A WELFARE 
OFFICE OF EOUCATlON 
THIS DOCUMENT HAS BEEN REPRO- 
DUCED EXACTLY AS RECEIVED FROM 
THE PERSON OR ORGANIZATION ORIG- 
INATING IT POINTS OF VIEW Oft OPIN 
IONS STATED DO NOT NECESSARILY 
REPRESENT OFFICIAL OFFICE OF EDU 
CATION POSITION OR POLICY 



o 

o 



*The research leading to the results reported here have been 
supported by funds from the National Sclenca Foundations, Grant No. 
GS3232 and by Institutional Research funds to the Center for Research 
in Social Behavior at the University of Missouri, Columbia, Missouri . 



vj 




1 



Our method of analyzing classroom language Includes two pro- 
cedural areas, reconstruction and automatic processing , which contain 
unexplored implications for language study. The propositions underlying 
the reconstruction system and the automatic processing algorithms will 
be discussed more or less separately although they are in fact 
interrelated. 

The process of adding to messages what is otherwise not 
directly observable in the overt communicating behavior (i.e., gram- 
matically and contextually implicit lnformat ion), and of structuring 
this behavior Into simplex sentences and lexical units, is referred 
to as reconstruction . In this definition, simplex sentences and 
lexical units are deemed the units of our analysis. Furthermore, our 
reconstruction system rests on the premises that implicit information 
is included :ln a corummication, and for that communication to be 
properly analyzable this Implicit information must be extricated 
from speech. 

Roughly speaking, a simplex sentence is a clause, and the 
notion of reconstructing a text is analogous to that of parsing a 
sentence into its constituent clauses. Naturally occurring sentences 
may vary from a slople form in which we find a single noun and 
intransitive verb to a complex form in which multiple subjects 
objects, or verbs appear, with adjectival modifiers supplying 
additional meaning. In order to provide a simple format for analyzing 
meaning, it la useful to break down the natural sentence into a series 
of simple propositions that represent lta meaning. The single verb 
propositions arc simplex sentences. A lexical urlt is a segment of 




2 



2 



reconstructed text which ia a nominal phrase, a verb phrase, a link, 
a nominal ized aentence slot, or uncoded material We have considered 
these the irreducible units of lexical analysis* 

In effect, we have provided our own definitions for parts of 
speech* That is, we have asserted that any conversation can be repre- 
sented by such lexical units as: a) nominal a (including adjectives, 
articles, prepositions aa well as nouns and pronouns), verbals 
(Including adverbs as well as the main verb and ita auxiliary struc- 
ture), c) links (including most conjunctive particles and traditional 
connecters), d) nominalized sentence slots (which, as embedded sentences, 
fulfill the function of a phxa3e in a larger sentence unit), and e) 
uncoded materials, such as uh-uh, yes, well, good, etc* 

In reconstruction , each lexical unit and each simplex sentence 
muSu be appropriately designated* This allows lexical units to be 
assigned to simplex sentences, and simplex sentences to be placed in 
order* To designate units, we adopted a scheme that made auch Identifi- 
cation precede the unit designated* Data were processed in sequential 
strings* One lexical unit was "ended" by the occurrence of the next 
identification number* This scheme demanded that all text "belong" to 
an identification number. That is, every word in the reconstructed 
text had to be assigned to its appropriate lexical unit number* Each 
naturally occurring sentence, each simplex sentence within it, and 
each lexical unit had to It continuous. Intervening structure within 
the simplex sentence was recorded by means of special conventions 
involving arbitrarily chosen punctuation marks* 




3 



Hierarchical claims about language underlie the process of 
separating natural sentences into lexical units. Some links are 
defined to represent a subordination of one simplex to another; as ; for 
example; "unless 11 and "because." To consider clauses as optionally 
countable units implies an ordering; and asserts that the larger 
natural sentence is supraordinate to the subordinate simplex struc- 
tures which underlie it. These asae rtions are built into the reconstruction 
procedures. Of course; the code irakes judgments about 
complexity of sentence structure derived from a transformational 
grammatical theory. 

Some implications of the lexical phrase and simplex sentence 
units relate to assumptions about the performance and competence of 
language users. It is assumed that members of a speech commuraity 
share a connon set of rules regarding appropriate; permissible 
utterances. It is further assumed that underlying these utterances 
are concepts that relate in some fashion to those employed by the 
granaoarlan to account for the competence of the speakers; such as 
simplex sentences and nomlnals. 

We assume that speakers possess an organised set of concepts; 
the content cf which is encoded and dispatched to the receivers; who 
in turn possess the organized concepts necessary to decode and under- 
stand the message content. Cf course; people do not communicate per- 
fectly; still; the greatest amount of communicative meaning should be 
captured if the system is constructed to represent messages as though 

they reflected these common conceptual organizations, If such organlza- 
to 

tlons were not^be found; speech would be intuitive; idiomatic and 



infinitely variant. 



4 



Our processing of speech for analysis adhered to the following 
steps. The initial transcribing of the videotapes was done by secre- 
taries who were untrained in linguistics. This process resulted in a 
rough transcription. A gross post-editi ng was performed, then a fine 
post-editing . by reseatch assistants with linguistic training who 
specialized in ferreting out information; for instance, scarcely 
audible material, specific noverbal information, and features of 
the interaction such as the target of the utterance. These editors 1 
interpretations provide control of veridlcallty. Trained undergraduates 
then reconstructed the text according to the system described in the 
manual referenced by Mr. Guyette. The reconstructed text then under- 
went a double reconstruction editing, first by a linguistically trained 
graduate student and then by the research coordinator (Dr. Barron). 
Ambiguities in interpretation were resolved • Judgments Involved in 
assignment of Identification numbers, by lexical unit, simplex sentence, 
and natural sentence classification, and by content of the lexical unit 
with regard to its form, its referent, etc*, were refined* 

Reconstructed texts were entered directly into the computer as 
a data source* Each lexical unit was reproduced as punches on £ com- 
puter card, including the text of the unit itself, the natural sentence 
number, simplex sentence number, unit type (verb, link, etc*)* This 
information waa then reproduced by computer in three separate formats. 
First was s list format , ii. which lexical units appeared separately, 
one per line. Next was a straight text format , with simplex sentences 
appearing as text, one per line, including the punctuation conventions 
we hed chosen to indicate such concept us implicitness or Inserted 
referents (see Exhibits A&B)* Third waa an expanded text format , which 

ERIC 



5 



featured the addition of type identification before each lexical unit 
within the text. Information in this form was then used for the coding 
of lexical units and sentences. 

The lexicon code concerned judgments about lexical units--words 
and phrases--such as their case function, pronominalization, gender, and 
implicitness or explicitness. The sentence code concerned judgments 
about language at the level of the simplex sentence, such as mode-- 
question, command, assertion — and structural complexity type--adjoining, 
conjoining, and embedding mechanism, A check program was run on the 
reconstruction to find format and punctuation errors. Coding judg- 
ments were keypunched and verified. Finally the coding and reconstructed 
text were collated, referenced, and stored on magnetic tape for analysis. 

Due to the sheer mass of the material, procedures were devised 
whenever possible to handle the process by computer. Utility programs 
were written to make corrections, to collate coding cards, and to ensure 
that information was “eproduced on tape in precise columnar form. We 
now have on tape a total of 230 classroom minutes of reconstructed and 
coded sentences, comprising approximately 83,000 computer card images. 

Some problem areas were painfully uncovered as we went along. 

For example, some arbitrary decisions with regard to coding proved to 
be less efficient than we had hoped. For instance, we would have 
Included some contention to indicate the head noun or head verb of a 
lexical unit, had we known that such indication would later prove 
to be desirable. This information is now retrievable only through 
human Judgment. 




6 



In the main, the restriction of back translatability of the 
reconstructed text to the fine-post-edited text has been adhered to. 

The exception--interrupted (noncontiguous) simplex sentences--will be 
accounted for in the next data processing by adding a new lexical unit 
designation. It is intended to computerize the back translation pro- 
cedures and get a numerical measure of recoverability by back trans- 
lation. 

Difficulty has also been encountered in the sequential num- 
bering of naturally occurring sentences. For example, if a sentence 
is accidentally "lost 11 in the assigning of identification numbers, 
then "found/ 1 all subsequent sentences must be renumbered. An expired 
classroom time designation might be a desirable substitution for sequen- 
tial numbers, given additional equipment. 

Now let us consider the computer programs available for pro- 
cessing reconstructed and coded* data. First , programs were developed 
for preparing textual listings of dsta, as has been described above. 
Second , retrieval and classification of the data stored on tape 
has been done by programming the computer to output the specific informa- 
tion required. A set of programs has been implemented which produces 
frequency counts for specifically requested variables. We have a case 
count program which produces a table of frequencies of case use, cross- 
classified by "teacher" vs. "pupil" emitter. Another program gives us 
cross-classifications of items by gender, for animate cases only, cross- 
classified by implicitness vs. explicitness. Data from these programs 
have been analyzed and results are presented here by Dr. Barron. 

One program produces a count of simplex sentences with respect 
to their complexity coding. These data have been subjected to 



7 



analysis for all segments of the sample, and the results are presented 
by Dr. Loflin. 

Still another program produces lists of every reconstructed 
refevent within a particular segment, together with a subsumed list 
of specific antecedents used for such referents. Frequencies of 
occurrence of each referent end their antecedents have been tabulated. 
Another png:am produces information as to the referent count, cross- 
classifying with respect to implicit vs, explicit occurrence and also 
with respect to pro-form vs. full-form occurrence. Results of analysis 
of these d-taare being presented by Mrs. Keyes and Mr. Guyette, 

A dictionary and word count has also been coraputer-pro'duced 
for each portion of the sample. 

A third set of programs was devised to produce and collect 

information more detailed than simple frequency counts. For instance, 

it turned out to be difficult to compare sentences from different parts 

of a segment of the data in order to make Judgments concerning 

similarity of meaning, inasmuch as such sentences might be separated 
a 

in time and space by considerable amount of text. Hence a program 
waa Implemented to sort simplex sentences by person, number, and 
gender of nominal items within the simplex, and to output the sorted 
sentences. Human judgments about similarity of meaning were much 
facilitated by such sorting. Verbs have also been sorted in accordance 
with their co-occurrence with animate cases and gender, including a 
cross-clisslflcatlon of "self 11 and "other** reference. This information 
is presently under analysis by the authors. 

Lists of case frames (patterns of cases within simplex sentences) 
occurring within the sample have been made, including a count of the 




8 



8 



number of times a given case frame occurs, and also a list of the verbs 
occurring within such case frames. 

Finally, an initial attempt to make a sequential analysis of 
data is underway. From information produced in the dictionary" for 
each sample, a short list of content words of high frequency was 
selected. To enter the computer with such a list allowed us to derive 
a display of loadings for each sentence* A cyclical pattern of such 
loadings emerges, when one examines the occurrence of these high 
frequency words (see Exhibit C). 

Work on such sequential analysis is new proceeding. As a first 
step in this analysis, we are in process of computing an entropy index 
for all the natural sentences* The entropy index was originally con- 
ceived to teat an hypothesis about the structural characteristics of 
topic units. This hypothesis was derived from the postulate that sub- 
ject matter was related to a set of structural (content-free) language 
characteristics: l.e*, certain pro- form substitutions, implicitness, 
and lexical repetition* These indices share the common feature of a 
lack of new information, or redundancy. Thus they should load more or 
less addltlvely on a common index of entropy* The hypothesis concerning 
topic variation slated that speech containing a new topic would be 
heralded by a burst of information, and then, sequentially, would be 
characterized by an increasing degree of entropy, or lack of new inform- 
ation, or repetitiveness. More succinctly, structural indices of 
language entropy were expected to vary in a cyclical fasnlon over time, 
and the cycles were expected to coincide with semantically-based judg- 
ments of topical units. Preliminarily, this relation seems in fact to 
exlk't; somewhere around 80 % is the level of entropy which characterizes 




9 



9 




a topic sequence. However, the entropy index has become fascinating 
in its own right, independent of topic. We expect to use it to docu- 
ment sociolinguistic characteristics of sequential speech patterns. 

We have produced two versions of such an index, Originally 
we attempted to calculate the index by looking at the occurrence of 
chosen structural components of each sentence. These included 1) 
implicitness of the lexical item, 2) occurrence of pronoun substitution 
(that, which, what, etc,, and personal pronouns such as he, it, etc,), 

3) occurrence of referents which had appeared as explicit items earlier 

in the body of data, and 4) absence of "new words" in the lexical 

unit. The index Was calculated as a ratio; out of the total number 

of lexical units which occurred in the natural sentence, what 

proportion were "redundant" because of any one or more of the four criteria 

above? 

The original index was not entirely satisfactory for two reasons. 

First, there was a biasing toward the beginning of a Delected body of 
data as an artifact of the Initialization, In addition, the storage 
required of the computer on a long segment became prohibitive, since 
all words already occurring had to be stored. In addition, we desired 
to explore the possibility that better prediction of top*c change was 
possible if the concept of additivity of tlu components was Incorporated, 

With the development of the "high frequency word" sentence loadings, 
it seemed that we might have here a substitute for the "new word" component 
of the entropy index. This hs» now been incorporated into the program 
which calculated the index. Because of this decision, calculation of 
the index has become a two-itnge process; first the high-frequency words 
are selected from tne dictionary for that segment of the data, then the 
computer checks each item for a redundancy load on each of the four criteria. 



10 



10 



The program produces six indices for a given sentence: a ratio of 
items which are redundant on each count (of the four criteria), a 
proportion of item redundancy of any count, and finally an average 
redundancy including all loadings in an additive sense. Data from 
these indices are illustrated in Exhibit D* 

It may be suggested that in the future some weighting of 
various structural components of natural sentences might be used for 
various purposes, such as the establishment of units of topic in class- 
room discourse. At present, equal weighting has been the only such scheme 
investigated, in the absence of any theoretical Justification for a 
choice of weights on any other basis. 

In principle, calculation of redundancy indices (or any other 
type of counting or sorting) might be done without computer processing. 
However, if we sre to use the computer, it is necessary to make explicit 
the steps that are Involved in such human judgments as counting and cal- 
culation, in order to translate such steps into instzuctlons for the 
computer* It is the enormous mass of the data and the repetitive nature 
of many of the Judgments Involved which have dictated our extensive use 
of computer processing* 

All programs mentioned here are written in PL-1 and have been 
Implemented on the IBM 365-65 at the University of Missouri* It should 
be emphasized that these programs have been tailored specifically to the 
reconstruction and coding systems which we have devised--that is, they 
make use of the formatting and punctuation and labeling conventions used 
in our processing of the data* Most of these programs are fast-running 
and require computer capacity of 200K or less* Any could be rewritten 




11 



11 



for a different reconstruction, ceding, or computer system relatively 
easily. 

We conceive of our language analysis system as applicable not 
only to classroom discourse, but to any corpus of linguistic communica 
tion. The universality of the system is one of its greatest advantages 



O 

ERIC 



12 



EXHIBIT A 




13 



EXMl&jr c 




